feat(celery Wave 5): single-PR close-out 16 backlog items (5-phase chunked-rotation)#1733
Merged
Conversation
…locate (legacy graphindex elimination first chunk) Wave 5 task #26 first chunk per architect msg=10b5fae6 §K.9 spec amendment + PM msg=af82797e dispatch (single Wave 5 PR with 5-phase chunked-rotation commits, big-PR fast-landing per earayu2 msg=eced858d). Two changes bundled (per PM msg=af82797e "Phase 1 first commit = §K.9 spec paste + legacy graphindex elimination 实现起步"): 1. **§K.9 Wave 5 spec section** added to design pack: - 5-phase commit roadmap (16 acceptance items) - production-readiness 三类 layer per phase - architect direct ratify lane scope per phase - pre-check pattern lock per phase (per ``feedback_spec_lock_grep_verify_caller.md`` 双 pattern) - Layer 1 same-session continuation directive (per ``feedback_no_refresh_complete_all_tasks.md``) - Wave 5 acceptance summary (16 items + owner table) 2. **`aperag/indexing/llm.py` relocate** (§K.9.1 acceptance item 3): - new module owns canonical ``build_collection_llm_callable`` + ``render_extraction_prompt`` + ``ENTITY_RELATION_EXTRACTION`` template + ``LLMCall`` type alias - legacy ``aperag/domains/knowledge_graph/graphindex/integration.py`` and ``prompts.py`` re-export from new location during Wave 5 deprecation window so legacy retrieval/curation callers keep working until Phase 1 close-out - ``aperag/indexing/graph_extractor.py`` updated to import from new ``aperag.indexing.llm`` module directly (instead of legacy graphindex bridge) - test ``test_graph_extractor.py`` monkey-patch sites updated to target ``aperag.indexing.llm`` module Pre-check pattern 1 grep-verify caller cascade (per architect msg= b26f64b2 chunk 4d Option C ruling, used as reference list for Phase 1 migration scope): - ``aperag/indexing/`` → 0 cross-references to legacy ``graphindex/storage/*`` (chunk 4d narrowed scope invariant maintained ✅) - legacy ``graphindex/integration.py`` and ``prompts.py`` callers: ``retrieval/pipeline.py:85`` / ``knowledge_graph/service.py:69+`` / ``graph_curation/service.py:37`` / ``graph_curation/integration.py:21`` / ``service/prompt_template_service.py:161`` — remaining migrations land in subsequent Phase 1 commits Local gates green: - ``ruff check ./aperag ./tests`` clean - ``ruff format --check ./aperag ./tests`` clean (509 files) - ``pytest test_graph_extractor.py + test_full_indexing_pipeline.py + test_worker_factory.py + tests/unit_test/indexing/``: 179 passed / 2 skipped (Layer 2 stubs) Production-readiness three-class layer (Phase 1 partial): - must-be-real: ``aperag.indexing.llm`` module is the canonical production home for the relocated LLM helpers (no import indirection through legacy package once retrieval/curation migration completes) - may-be-gated: legacy ``graphindex/{integration,prompts}.py`` re-export shims kept active during Wave 5 deprecation window so cross-cutting refactor can land incrementally without breaking legacy retrieval/curation flow - fully-resolves: §K.9.1 Wave 5 acceptance item 3 (`aperag/indexing/llm.py` relocate); items 1 (legacy graphindex package elimination) and rest of Phase 1 cascade migration land in subsequent commits Next Phase 1 commits: migrate ``retrieval/pipeline.py`` / ``knowledge_graph/service.py`` / ``graph_curation/*`` to §G.5 read primitives; once all callers migrated → delete legacy ``graphindex/{storage, service, integration, engine, __init__}.py`` + delete legacy tests; final commit verifies grep-zero invariant. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ON import to aperag/indexing/llm Wave 5 task #26 chunk 2 per §K.9.1 acceptance item 3 cascade (legacy graphindex caller migration): `aperag/service/prompt_template_service.py:155-163` (the `get_default_prompt(prompt_type="graph")` branch) was importing the ENTITY_RELATION_EXTRACTION template from the legacy `aperag.domains.knowledge_graph.graphindex.prompts` module. Per chunk 1 relocate (`11113acb`), the canonical home for the template is now `aperag.indexing.llm`. This commit migrates the import site so the caller no longer transitively touches the legacy module. Behavior unchanged — the legacy `graphindex/prompts.py` shim already re-exports the template from the new location, so this is a pure import-path refactor that leaves the runtime payload identical. Local gates: - `ruff check ./aperag ./tests` clean - `ruff format --check ./aperag ./tests` clean (509 files) Phase 1 cascade progress: - chunk 1 ✅ (`11113acb`): §K.9 spec + `aperag/indexing/llm.py` relocate - **chunk 2** (this commit): `service/prompt_template_service.py` import relocate - chunk 3+ pending: `retrieval/pipeline.py` / `knowledge_graph/service.py` / `graph_curation/*` migration (these need new design — legacy 24-method `GraphStore` Protocol → new 10-method `LineageGraphStore` Protocol API surface is NOT a 1-to-1 replacement; per Bryce msg=30c7e994 scope reality check + architect msg=b052b1b4 cascade ratify lane lock) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ent/intentional split + parse_version short-circuit + reconciler stuck-parse re-enqueue Wave 5 Phase 4 (PR-C / task #29) — closes 3 latent issues surfaced through Wave 4 ratify trail: * **P4-1 cleanup transient-vs-intentional split** (per architect msg=6aa8ca88 T2 ratify minor obs A): Pre-Wave-5 ``_resolve_cleanup_worker`` collapsed any factory exception into "drop the DB row". Transient infra errors (Qdrant blip, ES network glitch) lost the retry signal — DB row was deleted before next cycle could retry. Wave 5 P4 returns a new ``CleanupWorkerResolution(worker, transient)`` distinguishing the two: ``WorkerFactoryError`` is a by-design gate (drop the row to bound index growth), any other Exception is transient (skip DB row drop so next cycle retries when backend recovers). Counts surface ``transient_deferred`` so operators can track recovery rate. * **P4-2 parse_version short-circuit** (per huangheng T3 chunk 2 obs B + Wave 5 backlog item): Pre-Wave-5 ``parse_document`` always re-runs DocParser even when the resulting artifact directory is byte-identical (parse_version is content-derived). Wave 5 P4 adds ``short_circuit_if_artifacts_exist=True`` default — if all three derived artifacts (markdown.md / outline.json / chunks.jsonl) already exist under the canonical ``derived/parse_<version>/`` path, skip DocParser + writes entirely. Eliminates the ~30s OCR / Word rerun cost on rebuild of unchanged content. Tests can pass ``False`` to force re-parse. * **P4-3 reconciler stuck-document parse re-enqueue** (per architect Wave 4 T3 chunk 2 obs A — production gap close): Pre-Wave-5 parse failures (DocParser raise / source missing) silently dropped the document_id in the parse worker; operator saw ``document.status == PENDING`` forever with no signal to re-trigger. Wave 5 P4 adds ``reconcile_stuck_documents_for_parse_reenqueue`` to the reconciler loop. Detects documents with ``Document.gmt_created < now - cooldown_seconds`` AND zero ``document_index`` rows AND ``Document.gmt_updated < now - cooldown_seconds`` (cooldown filter prevents 30-s tick storm), then pushes a fresh ``ParseDispatchPayload`` matching the upload handler's contract. ``gmt_updated`` bumps after each push so the cooldown predicate rate-limits re-enqueue. Three-class tag (Wave 3 production-readiness invariant): * must-be-real: cleanup transient/intentional split prevents silent retry-signal loss; parse_version short-circuit eliminates OCR rerun waste; stuck-parse reconciler closes the document.status PENDING-forever gap * may-be-gated: ``short_circuit_if_artifacts_exist=False`` for tests pinning DocParser invocation count * fully-resolves: 3 of Wave 5 P1 backlog items (per architect ratify trail msg=6aa8ca88 + huangheng obs trail) Deltas: * ``aperag/indexing/cleanup.py`` — ``CleanupWorkerResolution`` dataclass + WorkerFactoryError-vs-Exception split in ``_resolve_cleanup_worker``; both ``cleanup_orphan_parse_versions`` + ``cleanup_for_deleted_documents`` honor ``resolution.transient`` to skip DB row drop on transient infra failures; collection cascade aggregates ``transient_deferred``. * ``aperag/indexing/parser.py`` — ``_all_artifacts_present`` predicate + ``short_circuit_if_artifacts_exist`` parameter + early-return ParseResult when all 3 artifacts present. * ``aperag/indexing/reconciler.py`` — ``STUCK_PARSE_COOLDOWN_SECONDS`` + ``reconcile_stuck_documents_for_parse_reenqueue`` async scan + ``_select_stuck_documents_for_reenqueue`` SQL query + ``_build_parse_payload_for_document`` payload reconstruction + ``_resolve_collection_parser_config`` / ``_resolve_collection_modalities`` (mirror ``document_service`` shape) + ``_mark_stuck_documents_reenqueued`` bumps gmt_updated; ``run_reconcile_loop`` calls the new scan. * ``aperag/indexing/__init__.py`` — re-export new symbols. * ``tests/integration/test_p4_robustness_3pack.py`` (new) — 13 tests across 3 layers (cleanup transient/intentional split / parse short-circuit / reconciler re-enqueue with cooldown + payload shape + skip-when-indexed + skip-when-no-object-path). * ``tests/unit_test/indexing/test_t3_1_dispatcher_path_c.py`` — added ``transient_deferred: 0`` to expected empty-counts dict. Local gates: * ``pytest tests/unit_test/indexing/ tests/integration/`` — 232 passed / 48 skipped (incl. 13 new ``test_p4_robustness_3pack``). * ``ruff check aperag/ tests/integration/test_p4_robustness_3pack.py`` — clean. * ``ruff format`` — applied. Branch: local ``chenyexuan/celery-wave5-p4`` based on main ``19d3d70`` (Wave 4 PR #1731 squash); will rebase onto ``bryce/celery-wave5`` once Bryce opens the Wave 5 draft PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…e 6 backlog
Per architect Option (a) ruling 2026-04-27 (post-cascade scope check
msg=30c7e994 + ratify msg=8780c937):
* §K.9 Phase 1 narrowed scope: relocate-only (chunks 1+2 already
shipped via `11113acb` + `e0707165`). Original Phase 1 scope (delete
legacy `graphindex` package + migrate retrieval/curation callers)
collided with unresolved design gap — legacy `GraphIndexService.
query_context()` 24-method LightRAG-style API has no equivalent on
the new 10-method `LineageGraphStore` Protocol; building that query
layer is 1-2 week design + implementation effort, not in-scope for
Wave 5 single-PR fast-landing style.
* §K.10 Wave 6 backlog added: full legacy graphindex elimination +
retrieval/curation migration + new LightRAG-style query layer
design moves to Wave 6 separate PR with its own architect ratify
lane (per chunk 4d Option C ruling msg=b26f64b2 + Wave 4 §K.8
precedent).
* Wave 5 acceptance summary updated: 15/16 items in Wave 5 PR
(item 1 legacy graphindex elimination → Wave 6).
* Phase 1 close-out signal: chunk 1 (`11113acb`) +chunk 2 (`e0707165`)
ship llm.py relocate + import re-route to new canonical home;
legacy `graphindex/{integration,prompts}.py` keep deprecation-shim
re-exports during the Wave 6 deprecation window so legacy
retrieval/curation callers do not break mid-Wave-5.
This commit closes Phase 1 narrowed scope and unblocks Phase 2 (T7
multimodal vision-LLM 3-item bundle) per architect ruling.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wave 5 Phase 3 (task #28) — replaces the pre-Wave-5 ``pytest.skip`` stubs in ``test_full_indexing_pipeline.py`` Layer 2 with functional test bodies that the e2e-http-compose CI lane can execute against the live backend stack. Per architect msg=fdd53586 + chunk 4d+4e ratify msg=c279a0ff — Layer 2 contract was declared but skip-stub-only; this phase wires the actual implementation so the canonical Phase 1 invariant runs end-to-end whenever the operator opts in via ``RUN_E2E_PHASE1_SMOKE=1``. Two test bodies wired: * ``test_phase1_full_pipeline_vector_fulltext_summary_active_graph_vision_failed`` — the canonical Phase 1 smoke (architect msg=da3012a4): real ``ProductionWorkerFactory`` + parse_document + dispatch_indexing + worker pool drive until every modality finalises. Asserts vector + fulltext + summary reach ACTIVE; graph + vision finalise FAILED with a gate marker (``Wave 4 wiring`` chunk 4b / ``completion model`` post-T1 self-disable / ``multimodal`` vision gate). The gate-marker OR tolerates the same-Wave T1 self-disable surface so the test stays green across the chunk 4b → T1 closure. * ``test_phase1_multi_keyword_fulltext_search_returns_hits`` — sweep D verification (architect msg=fdd53586): index a real document via the worker pool, then issue a 3-keyword query through ``_fulltext_search`` and assert ≥1 hit. The retrieval- side ``minimum_should_match`` arithmetic over N×content + N×title should-clauses (huangheng msg=fb64468c flag) is the latent issue this exercises end-to-end. New fixture machinery: * ``_phase1_e2e_skip_reason()`` — central skip-reason resolver that documents the activation contract: ``RUN_E2E_PHASE1_SMOKE=1`` + ``PHASE1_E2E_COLLECTION_ID`` (pointing at a Collection seeded by the e2e-http-compose bootstrap with a real model provider configured) + backend env vars (``DATABASE_URL`` / ``ES_HOST`` / Qdrant + Redis env vars). Both Layer 2 tests share the same skip predicate so the lane runner only needs one env-var setup. * ``_resolve_phase1_e2e_engine()`` — opens a real Postgres engine using ``settings.database_url``; skips with a clear reason if unset. * ``_run_phase1_workers_until_quiet()`` — bounded async loop that drives ``ProductionWorkerFactory`` directly (mirroring the ``run_*_worker`` lifespan path) until every modality row reaches a terminal status. Bounded by ``timeout_seconds`` so a hung modality fails the test loud rather than blocking forever. Three-class tag (Wave 3 production-readiness invariant): * must-be-real: live ProductionWorkerFactory + real backends + real ES query * may-be-gated: skips when env vars missing (local-dev never requires the full stack) * fully-resolves: 2 of Wave 5 P1 backlog items (Layer 2 stubs activated for canonical full pipeline + sweep D multi-keyword smoke) Cleanup roundtrip (delete document → cleanup loop → backend artefacts gone) is left as a follow-up sub-test to land alongside the e2e-http-compose lane document-delete API access scaffolding. Local gates: * ``pytest tests/integration/test_full_indexing_pipeline.py`` — 4 passed / 2 skipped (Layer 2 skips cleanly with the new ``_phase1_e2e_skip_reason()`` until env vars set in CI lane). * ``pytest tests/unit_test/indexing/ tests/integration/`` — 232 passed / 48 skipped — no regression vs P4 baseline. * ``ruff check aperag/ tests/integration/test_full_indexing_pipeline.py`` — clean. * ``ruff format`` — applied. Branch: ``chenyexuan/celery-wave5-p4`` rebased on ``bryce/celery-wave5`` HEAD ``99af7965`` (Phase 1 chunk 3 spec amend) → push to ``bryce/celery-wave5`` so Wave 5 single-PR pile continues. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ace (T7 item 1) Wave 5 task #27 chunk 1 per §G.2.5.1 amendment item 1: ``EmbeddingService.embed_image(image_bytes, alt_text)`` API surface is the canonical multimodal embedding entry point that replaces the Wave 3 placeholder ``embed_query(f"{image_id}|{alt_text}")`` string- concat (Wave 3 lesson #10 broken pattern). Behavior: * When ``self.multimodal=True`` (operator wired a multimodal embedder on the collection's spec), encode image bytes as base64 data URL + forward to LiteLLM ``embedding(input=[{"image_url":{"url":...}}, ...])`` shape that multimodal-capable providers (Voyage / Jina v3 / OpenAI multimodal / etc.) accept natively. * When ``self.multimodal=False``, raise ``EmbeddingError`` with clear operator-facing diagnostic — chunk 4b vision gate already prevents this state but runtime check is defense-in-depth (Wave 3 lesson #10 ship-incomplete-but-don't-silently-lie). * ``alt_text`` paired into LiteLLM input as a textual hint for embedders that accept multi-part inputs; image-only embedders silently drop the text element. The ``aembed_image`` async wrapper mirrors the ``aembed_query`` pattern (``asyncio.to_thread`` since the LiteLLM call is sync). Wave 5 P2 chunk-1 scope (this commit): * declares the API surface so Phase 2 chunks 2 (parser image extraction) + 3 (provider v3 capability flag UI) and the chunk 4b vision gate self-disable path have a concrete contract to wire to * uses imghdr-based MIME detection + LiteLLM's documented multimodal input shape; provider-specific input format variations defer to Wave 6 (per §K.10 Wave 6 backlog cross-cutting refactor) Local gates green: - ``ruff check ./aperag ./tests`` clean - ``ruff format --check ./aperag ./tests`` clean (510 files) - ``pytest`` 179 passed / 2 skipped — no regressions on existing EmbeddingService callers (text-only path unchanged) Production-readiness three-class layer: - must-be-real: real LiteLLM multimodal embedding call against the operator-configured provider when ``multimodal=True`` - may-be-gated: provider-specific input-format adjustments (Voyage vs Jina vs OpenAI multimodal differences) deferred to Wave 6 - fully-resolves: §G.2.5.1 item 1 (``embed_image`` API surface) + §K.9 Phase 2 partial — items 2+3 (parser image extraction + provider v3 capability flag UI) follow in subsequent commits Next Phase 2 chunks: - chunk 2: parser image extraction → ``derived/parse_<v>/vision/images/<image_id>.<ext>`` - chunk 3: provider v3 router multimodal capability flag UI exposure - chunk 4: ``worker_factory._build_vision_worker`` ``_embed`` callsite rewrite to use ``embed_image`` + chunk 4b vision gate self-disable verify (Phase 1 Layer 1 test rename) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ractor caps + timeout
Per huangheng T1 obs A msg=6b349693: surface the LightRAG-style
extractor's per-chunk caps and LLM timeout as collection-level
overrides (`KnowledgeGraphConfig.{max_entities_per_chunk,
max_relations_per_chunk, per_chunk_timeout_seconds}`) so deployments
tuning a slow LLM provider or extracting from entity-dense documents
can lift the defaults without patching `aperag/indexing/graph_extractor.py`
constants.
* `aperag/schema/common.py`: 3 new optional fields on KnowledgeGraphConfig.
* `aperag/indexing/graph_extractor.py`: `_resolve_int_kg_config` /
`_resolve_float_kg_config` helpers (mirror `_resolve_entity_types`
pydantic-attr / Mapping / JSON-string tolerance pattern + reject
non-positive / non-numeric values with warning + default fallback);
`_extract_one_chunk` now takes `timeout_seconds` kwarg and
`asyncio.wait_for` wires it instead of the global constant.
* `tests/integration/test_graph_extractor.py`: 6 new unit tests pinning
override-wins / fallback-on-missing / fallback-on-non-positive /
int-coerced-to-float / fallback-on-garbage; fixed pre-existing
`_extract_one_chunk` call to pass `timeout_seconds=60.0`.
Defaults unchanged (32 / 32 / 60.0); collections that don't set the
new fields keep current behavior.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… forward-compat + OTLP config cross-check + cleanup builder share Wave 5 Phase 5B (task #31) — 4 polish items from the Wave 4 ratify trail accumulated obs (chunk_id schema unify deferred to a follow-up once parser canonical schema lock lands): * **P5B-A utc_now → CURRENT_TIMESTAMP unify** (per huangheng chunk 4a obs A): ``_LineageEntityRow`` + ``_LineageRelationRow`` ``gmt_created`` / ``gmt_updated`` columns now use ``server_default=text("CURRENT_TIMESTAMP")`` mirroring the alembic migration declaration. Pre-Wave-5 ORM used ``default=utc_now`` Python-side; alembic check passed because Postgres treats both as semantic-equivalent but the per-mirror discipline is stronger when both layers speak the same dialect (schema-touching follow-ups cannot drift undetected). * **P5B-B tenant_scope_key org-prefix forward-compat** (per T3 chunk 2 obs C + §H.2 forward-compat lock): new ``aperag.indexing.parse_orchestrator.resolve_tenant_scope_key`` helper centralises the prefix construction. Pre-Wave-5 the prefix was hard-coded as ``f"user:{document.user}"`` at every callsite (upload handler, reconciler stuck-doc re-enqueue); the helper unifies the construction so a future Wave 6/7 organisation-tenant rollout flips one place. The org branch stays inert for Wave 5 (Document/Collection schemas don't carry org_id yet) — the helper just makes the seam explicit. * **P5B-C OTLP config cross-check** (per huangheng T6 chunk 1 obs): lifespan startup logs a clear warning when ``INDEXING_METRICS_EMITTER=otlp`` but ``APERAG_OBSERVABILITY_MODE`` is not set to ``otlp`` / ``collector``. Pre-Wave-5 this combination silently produced an ``OTLPMetricsEmitter`` whose ``MeterProvider`` was never installed by ``aperag.observability.metrics.init_metrics_provider`` — samples no-op'd silently, defeating the explicit ``INDEXING_METRICS_EMITTER=otlp`` opt-in. * **P5B-D cleanup builder share helper** (per huangheng T2 obs B drift risk): new ``_build_collection_qdrant_connector(collection, *, allow_vector_size_fallback)`` shared helper used by ``_build_vector_worker`` / ``_build_summary_worker`` (dispatch, ``allow_vector_size_fallback=False``) AND ``_build_qdrant_cleanup_backend`` (cleanup, ``allow_vector_size_fallback=True``). Pre-Wave-5 dispatch and cleanup paths duplicated the embedder + connector wiring; a future Qdrant adapter signature change had to be applied twice. ``_build_vision_worker`` keeps its pre-helper shape — its ``is_multimodal()`` gate must fire BEFORE any network call to preserve the Wave 4 chunk 4b "fail fast on non-multimodal embedder" invariant the existing test pins. Three-class tag (Wave 3 production-readiness invariant): * must-be-real: 4 polish items each tighten a real production seam (alembic mirror discipline / org-prefix forward-compat / OTLP config consistency / cleanup-vs-dispatch drift surface) * may-be-gated: org-prefix branch stays inert until org_id columns land (declared in helper docstring) * fully-resolves: 4 of the 5 P5B chenyexuan-batch backlog items. ``chunk_id`` schema unify is left to a follow-up — parser canonical schema lock first needs the §C.x amend; touching the ``or chunk.get("id")`` fallback now without that lock would introduce drift in the opposite direction. Deltas: * ``aperag/indexing/graph_storage/postgres.py`` — ``_LineageEntityRow`` + ``_LineageRelationRow`` switched to ``server_default=CURRENT_TIMESTAMP``; ``utc_now`` import dropped. * ``aperag/indexing/parse_orchestrator.py`` — ``resolve_tenant_scope_key(*, document, collection)`` helper + re-export. * ``aperag/domains/knowledge_base/service/document_service.py`` — ``_create_or_update_document_indexes`` calls ``resolve_tenant_scope_key`` in place of the inline ``f"user:{document.user}"``. * ``aperag/indexing/reconciler.py`` — ``_build_parse_payload_for_document`` calls the same helper for the stuck-document re-enqueue producer (consistency with upload handler). * ``aperag/app.py`` — lifespan logs the OTLP-vs-observability-mode cross-check warning before constructing ``OTLPMetricsEmitter``. * ``aperag/indexing/worker_factory.py`` — ``_build_collection_qdrant_connector`` shared helper + ``_build_vector_worker`` / ``_build_summary_worker`` / ``_build_qdrant_cleanup_backend`` reuse it; vision keeps its multimodal-gate-first shape. Local gates: * ``pytest tests/unit_test/indexing/ tests/integration/`` — 232 passed / 48 skipped (no regression). * ``ruff check aperag/ tests/integration/`` — clean. * ``ruff format`` — applied. Branch: ``chenyexuan/celery-wave5-p4`` rebased on ``bryce/celery-wave5`` HEAD ``4f0e9b0`` (P2 chunk 1 embed_image API surface) → push to ``bryce/celery-wave5`` for Wave 5 single-PR pile. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per §K.9.1 P5A item 14 + design pack §K.9 Phase 5A item 4: deployments that share a Neo4j instance with user-owned graphs would collide on the generic `LineageEntity` / `LineageRelation` labels. Bump them to `aperag_LineageEntity` / `aperag_LineageRelation` and align constraint names (`aperag_lineage_entity_collection_name_unique` / `aperag_lineage_relation_collection_triple_unique`). Hard-cut second round (no data migration): Wave 4 graph indexing defaulted `enable_knowledge_graph=False`, so no production deployment has data on the legacy unprefixed labels. Postgres / Nebula backends are unaffected (table names already namespaced via `aperag_lineage_*` schema in alembic migration `e7a3b9c2d1f6`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…h explicit reasoning Per Wave 5 P5A scope close-out 2026-04-27: * Item 2 (chunk 4b _no_op_extractor identity check → attribute marker) is N/A: the placeholder was deleted in Wave 4 T1 (`19d3d70f`) when `build_collection_graph_extractor` replaced `_no_op_extractor`. There is no surviving identity-check site to harden — moot in current code. * Item 3 (W5-perf-graph-lineage parallel-list O(N) cross-backend) is deferred to Wave 6: Postgres needs JSONB → text[] migration, Nebula needs tag re-model, plus alembic column-shape change. >10k docs/entity is not a Wave 5 acceptance criterion; trigger when observed in real deployments. * Item 5 (Cypher type keyword rename `n.type` → `entity_type` / `relation_type`) is deferred to Wave 6: cross-backend rename (Postgres column + alembic; Cypher property rename; Nebula tag-prop rename) plus §D.3 Protocol-surface change (`EntityRecord.type` → `EntityRecord.entity_type` etc.). Cypher `TYPE()` is technically only for relationships so current code is unambiguous — forward-compat hygiene rather than correctness fix. §K.9 Phase 5A acceptance lines amended to reflect ship status (items 1+4 shipped; 2 N/A; 3+5 → Wave 6). §K.10 Wave 6 backlog extended with items 6+7 covering the deferred cross-backend rewrites. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…se_<v>/vision/{images,source.jsonl}
Per §G.2.5.1 spec amend item 2: when DocParser produces `AssetBinPart`
payloads (PDF page images, single-image-input passthrough, data-URI
extracted images), the parser writes each image blob to
`derived/parse_<v>/vision/images/<image_id>.<ext>` and lands a
`vision/source.jsonl` descriptor enumerating them. The vision worker
(chunk 4) consumes the descriptor instead of the T1 simulator's
synthetic `images.json` companion.
`aperag/indexing/parser.py`:
* `_docparser_extract_markdown` now returns
`tuple[str, list[_VisionImageAsset]]` — markdown body unchanged plus
the extracted image assets list. Non-image `AssetBinPart`s (audio /
PDF data) drop. Duplicate `asset_id`s deduplicated to keep
Qdrant-point-id stable.
* `_VisionImageAsset` carrier holds (`image_id`, `data`, `mime_type`,
`alt_text`, `page_idx`, `bbox`).
* `_vision_image_extension(mime_type)` maps known image MIMEs to a
filename extension; falls back to `.bin` for unknown types since
vision worker uses `imghdr` on bytes at embed time.
* `_write_vision_assets` persists each blob via `write_atomic` and
writes the JSONL descriptor.
* `ParseResult.vision_source_path` and `vision_image_count` are new
optional fields (defaults `""` / `0`) so pre-Wave-5 callers see no
behaviour change. Image-only inputs (no markdown emitted) still land
their assets so vision modality has bytes to embed.
`tests/unit_test/indexing/test_parser_image_extraction.py`: 9 unit tests
covering MIME → extension mapping, descriptor schema, simulator-path
no-op, image-only input asset persistence, unknown-MIME `.bin`
fallback, and the new `ParseResult` fields. Tests monkeypatch
`_docparser_extract_markdown` to avoid pulling MinerU / MarkItDown
into the unit-test path.
Production-readiness 三类:
- must-be-real: real `AssetBinPart` extraction wired through DocParser
- may-be-gated: descriptor + image artefacts only land when DocParser
surfaces `AssetBinPart`s (image-less docs see no vision/ writes)
- fully-resolves: §G.2.5.1 spec item 2 (parser image extraction) +
Wave 5 P2 chunk 2 acceptance
The vision worker callsite rewrite (chunk 4) and provider v3 multimodal
flag UI (chunk 3) remain. Defaults preserve existing behaviour for
text-only callers.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…apability flag Per §G.2.5.1 spec amend item 3: surface a typed capability flag for embedding models that accept image bytes (CLIP / Voyage Multimodal / Jina v3 / OpenAI multimodal embeddings) so operators can register them via the v3 model platform UI. The flag drives the chunk 4b vision gate's `EmbeddingService.is_multimodal()` runtime check — flip it on the collection's embedder spec model and the gate self-disables, enabling vision modality. Distinct from existing `supports_vision` (chat/completion models that accept image input) — `supports_multimodal_embedding` describes embedding models that produce vectors from images. Changes: * `aperag/domains/model_platform/schemas.py`: new optional `supports_multimodal_embedding: bool = False` on `Model` / `ModelCreate` / `ModelUpdate` (default False, no behaviour change for existing rows). * `aperag/domains/model_platform/db/models.py`: new `supports_multimodal_embedding` Boolean column with default False. * `aperag/migration/versions/...f1c8d2a5b6e3...`: alembic migration adding the column with `server_default=false`. Forward-only cutover safe because pre-Wave-5 rows default to False (matches prior implicit behaviour). * `aperag/domains/model_platform/service/model_service.py`: `_model_to_schema` maps the new column. * `aperag/db/repositories/llm_provider.py`: `create_model` accepts the new field so the v3 `POST /models` flow persists it. * `aperag/llm/embed/base_embedding.py`: `get_embedding_service` prefers the typed `supports_multimodal_embedding` column; falls back to the legacy `runner_config["multimodal"]` JSON entry so pre-Wave-5 operators who edited the JSON keep working without re-saving the row. Tests: 3 new unit tests in `test_model_platform_v3_contract.py` covering default-False behaviour, opt-in True path, and v3 OpenAPI schema exposure on Model / ModelCreate / ModelUpdate. Production-readiness 三类: - must-be-real: real DB column + alembic migration + v3 API surface - may-be-gated: legacy `runner_config["multimodal"]` fallback for rows created before the typed column landed - fully-resolves: §G.2.5.1 spec item 3 (provider v3 multimodal capability flag UI) + Wave 5 P2 chunk 3 acceptance Phase 2 chunk 4 (callsite rewrite + chunk 4b gate self-disable verify + Layer 1/2 test rename) remains as the final Wave 5 blocker. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…te self-disable verify Per §G.2.5.1 spec amend final piece: rewire `_build_vision_worker._embed` to call `EmbeddingService.embed_image(image_bytes, alt_text)` (chunk 1) with the actual image bytes the parser persisted (chunk 2 wrote `derived/parse_<v>/vision/images/<image_id>.<ext>` + a JSONL descriptor at `vision/source.jsonl`). The chunk 4b vision gate self-disables when the operator flips `Model.supports_multimodal_embedding=True` (chunk 3). `aperag/indexing/worker_factory.py`: * `_embed(image_id, alt_text, image_bytes=None)` — when image_bytes is provided, route to `embedding_service.embed_image(image_bytes, alt_text)`. None falls back to the legacy text-concat path so the T1 simulator + tests that hand the worker synthetic JSON keep working. * Gate-raise message reframed: drops "Wave 4 wiring" phrasing (now Wave 5 wiring is land), names the typed `Model.supports_multimodal_embedding` flag so an operator can fix the config directly. `aperag/indexing/vision.py`: * `VisionModality.derive` accepts both source-path formats: the legacy single-JSON-array shape (T1 simulator / pre-Wave-5 tests) AND the new JSONL-with-image-path shape (parser chunk 2 output). Format detection is by first non-whitespace byte (`[` → JSON array, else → JSONL). * `_load_image_bytes(record)` reads the descriptor's `image_path` through the object store; missing blob logs a warning and returns None (embedder still runs on the alt-text/id placeholder digest) so a partial parser write doesn't block the whole derive cycle. * `_placeholder_embedding(..., image_bytes=None)` mirrors the new embedder signature; placeholder ignores bytes. * Embedder Protocol is widened: `(image_id, alt_text, image_bytes=None)`. `tests/integration/test_full_indexing_pipeline.py`: * Renamed Layer 1 `test_phase1_vision_modality_raises_wave4_wiring_gate` → `..._gate_raises_when_embedder_not_multimodal`. Asserts the reframed message names `multimodal embedding model` + `supports_multimodal_embedding` flag. * New positive-path Layer 1 `test_phase1_vision_modality_gate_self_disables_when_embedder_multimodal`: with `is_multimodal()=True`, the factory builds a vision worker without raising — pins the chunk 4 gate-self-disable contract. * Layer 2 e2e assertion: vision modality may be ACTIVE (when CI fixture has multimodal embedder configured) OR FAILED with a gate marker including `supports_multimodal_embedding`. OR-on-marker tolerance kept for transition state. `tests/unit_test/indexing/test_t1_4_summary_vision.py`: 3 new tests covering JSONL descriptor with image_path (bytes loaded + forwarded), missing-blob graceful fallback, and legacy JSON-array backward compat. Production-readiness 三类: - must-be-real: real `embed_image` callsite + real bytes load from parser's descriptor - may-be-gated: legacy text-concat fallback + simulator JSON format preserved for tests - fully-resolves: §G.2.5.1 spec items 1+2+3 all wired end-to-end + chunk 4b vision gate self-disable verify (Wave 5 P2 closure) Wave 5 P2 (T7 multimodal vision-LLM 3-item bundle) closed. The chunk 4b vision gate self-disables when an operator configures a multimodal embedder; default-off behaviour preserved for text-only collections. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fdad09e to
8e7c6ec
Compare
8 tasks
earayu
added a commit
that referenced
this pull request
Apr 27, 2026
…cation cache (#1735) Wave 5 P2 chunk 4 (#1733) landed the canonical multimodal embedding API surface (`EmbeddingService.embed_image(image_bytes, alt_text)`) but left the call un-cached — every vision modality embed now hits the LiteLLM provider, even for the same image. This wires it into the canonical `aperag.cache.NAMESPACE_EMBEDDING` infra (PR #1734) mirroring the existing text `_embed_batch` pattern. Cache key shape (per `aperag/cache/README.md` no-raw-bytes policy): { "kind": "image", "provider": ..., "model": ..., "api_base": ..., "api_key_hash": sha256(api_key), "file_hash": sha256(image_bytes), "alt_text": ..., "multimodal": True, } Image bytes are identified by their sha256 hex digest so the Redis key stays bounded; alt_text is part of the key because providers that accept paired text+image inputs return a different vector when the textual hint changes (alt_text="" collapses to one key for image-only callers). Tests ----- New `tests/unit_test/llm/test_embed_image_cache.py` (7 tests): * identical (bytes, alt_text) → second call hits cache (no upstream) * same bytes + different alt_text → distinct keys, both compute * different bytes + same alt_text → distinct keys, both compute * key shape uses sha256 file_hash, raw bytes never appear in key * `caching=False` bypasses cache (always upstream) * `multimodal=False` raises EmbeddingError (defense-in-depth) * empty image_bytes raises EmptyTextError Full unit suite: 1022 passed, 29 skipped, ruff + format clean. Out of scope (per task #37 boundary) ------------------------------------ Provider-specific multimodal embedder format variations (Voyage / Jina v3 / OpenAI multimodal SDK input shapes) stay on task #39 per PM dispatch + simple-stable directive (`feedback_simple_stable_zero _maintenance.md`).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wave 5 indexing redesign close-out — 16 backlog items in single PR with 5-phase chunked-rotation commits (per architect msg=11442bbf reframe + earayu2 msg=eced858d "大PR 快速落地" directive). Same big-PR pattern as Wave 4 PR #1731.
§K.9 spec amendment locked + Phase 1 chunk 1 (relocate
build_collection_llm_callable+render_extraction_prompt+ENTITY_RELATION_EXTRACTIONtoaperag/indexing/llm.py) shipped.Phases
aperag/indexing/llm.pyrelocateTasks
aperag/indexing/llm.pyrelocategraphindex/{storage,service,integration,engine,__init__}.pyTest plan
ruff check ./aperag ./testsclean (per phase)ruff format --check ./aperag ./testsclean (per phase)aperag/indexing/*不 cross-ref legacygraphindex/storage/*embed_imageAPI +is_multimodal=Truereal wire — chunk 4b vision gate self-disable verify🤖 Generated with Claude Code